Windows Expert

home *** CD-ROM | disk | FTP | other *** search

/ Windows Expert / Windows Expert.iso / desktop / iindxv10.zip / INDEX.DOC < prev next >

Wrap

Text File | 1993-03-22 | 29KB | 808 lines

Instant Index Transform-Based Full Text Indexing and Search Software Version 1.0 Documentation Instant Index, Copyright (C) 1992 1993 Theodore A. Holden, All rights Reserved LICENSE Instant Index v1.0 is neither free software nor is it in the public domain. The software and its documentation, this file, are property of the author and may not be sold without written permission from the author. Instant Index v1.0 is distributed as shareware. This means that you are granted a limited license to use it for a period of 30 days. If you find it useful and decide to continue using it after the trial period, registration is required. Registered Individual users will be granted a just-like-a-book license which means a registered version of the software can be used by more than one person and can be moved from one computer to another so long as there is NO POSSIBILITY of it being used by two different persons on two different computers at the same time, just like a book can not be read by two different persons in two different places at the same time. This is mainly intended to allow the typical individual user to use the product on a computer at home and on a computer at the office. Two individually licensed copies of Instant Index, of any version, with the same serial number, may not legally appear on more than one computer at any place of business, government agency, school, etc. Version 2.o of Instant Index is a commercial product and is not shareware. Commercial site licenses for all versions of Instant Index are available at reasonable rates. Instant Index Copyright 1992, 1993 Ted Holden TERMS OF DISTRIBUTION : Redistribution of version 1.0 of Instant Index must include the software, its documentation file, order form and all supplemental files as a single unit without any modification AND subject to the following conditions: 1. Any individual is welcome to make copies for his/her friends and/or colleagues if NO FEE is charged. 2. Electronic bulletin boards, whether they charge or do not charge their users subscription fee, are welcome to post the program for down loading as long as they do not charge any fee in particular for the distribution of Instant Index. 3. Computer information services such as CompuServe (CIS), Genie, etc., may post this software for their subscribers. 4. Non-commercial user groups and computer clubs may distribute the program to their members if the fee charged for the diskette containing Instant Index does not exceed $10. 5. Disk vendors approved by the Association of Shareware Professionals or disk vendors who explain the concept of shareware in their ads that quote a price may distribute the shareware version of Instant Index. 6. Persons or enterprises wishing to distribute Instant Index in combination with other hardware, software, books or materials must obtain proper licensing agreements from HT Enterprises. Instant Index Copyright 1992, 1993 Ted Holden DISCLAIMER OF WARRANTY THIS SOFTWARE AND MANUAL ARE SUPPLIED "AS IS". THE AUTHOR HEREBY DISCLAIMS ALL WARRANTIES RELATING TO THIS SOFTWARE AND ITS DOCUMENTATION FILE, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO DAMAGE TO HARDWARE, SOFTWARE AND/OR DATA FROM USE OF THIS PRODUCT. IN NO EVENT WILL THE AUTHOR OF THIS SOFTWARE BE LIABLE TO YOU OR ANY OTHER PARTY FOR ANY DAMAGES. THE WORST POSSIBLE CASE FOR SOFTWARE FAILURE, IN OUR VIEW, WOULD BE FOR THE COMPUTER INVOLVED, THE HOUSE OR BUILDING IN WHICH IT IS LOCATED, AND THE ENTIRE NEIGHBORHOOD CONTAINING THAT BUILDING TO BURN TO THE GROUND DUE TO SOME UNFORSEEN SOFTWARE BUG; EVEN IN THAT CASE, NEITHER THEODORE HOLDEN NOR HT ENTERPRISES WILL ACCEPT ANY LIABILITY. HT Enterprises cannot and will not be liable for any special, incidental, consequential, indirect or similar damages due to loss of data or any other reason, even if HT Enterprises or an authorized HT Enterprises agent has been advised of the possibility of such damages. In no event shall the liability for any damages ever exceed the price paid for the license to use software, regardless of the form and/or extent of the claim. The user of this program bears all risk as to the quality and performance of the software. YOUR USE OF THIS SOFTWARE INDICATES THAT YOU HAVE READ AND AGREE TO THESE AND OTHER TERMS INCLUDED IN THIS DOCUMENTATION FILE. VERSIONS AVAILABLE Version 1.0 of Instant Index (included) is a shareware version which handles one (presumed large) ascii text file, with a .TXT suffix, at a time. This version is more than proof of concept; it should actually be of more value than the full multi-file version to certain groups of users, particularly CD ROM vendors and others involved in distributing large ascii text files. Naturally, licensing arrangements must be made with HT Enterprises and the author of this software for such use. Version 2.0 uses the same indexing technology to index and search entire directories, and all of the files in them. Text in ascii files is shown via the text handling mechanism of Instant Index itself; application files are brought up either in this method or in the applications which created them. Provision is made for applications which do not use file extensions. This version serves the needs of the user who has lots of text in lots of files and requires being able to very quickly find the files which contain a certain text pattern or group of words. WHAT IT IS Instant Index requires a 386 or 486 computer and MS Windows 3.1. It does not run in any other environment as of yet. Instant Index represents a software genre which most will be less familiar with than they are with the usual spreadsheets and wordprocessors. This genre is called full text search, and involves indexing large to gigantic bodies of text on disk in such a way as to allow large scale and rapid searching for words, phrases, and combinations of words in proximity etc. Large bodies of text are just now becoming increasingly common and available in DOS format, particularly with the proliferation of CD technology. A really good program for handling large bodies of text is clearly needed. There are two reasons why the average PC user is not familiar with this software genre: 1. Until now, such software has been very expensive. License fees of $1000 to $20,000 for a single user computer have been the norm. 2. Until now, such software has been very slow; recent articles in PC Week and InfoWorld describe leading products taking upwards of two hours to index text files ranging from 13 - 26 MB. The average PC user would have (justifiable) difficulties in dealing with this psychologically. Basically, anything which takes two hours or more to happen on a 486 isn't really a solution to anything; it's a problem. The HT Enterprises Instant Index program solves both problems. It is priced well within the reach of the average PC user and is FAST. II can index a 100 MB file in under 20 minutes on a typical 33 MH 486 PC running MS Windows. It is something like 100 times faster than the fastest products until now. We don't really know how large a file you could use with Instant Index on ordinary 386/486 PC's; we suspect it could handle files in the .6 GB to GB range. Normal use for II would be to find a certain section of text and paste it into a wordprocessing document in Ami Pro, WordPerfect, or some other full-function Windows wordprocessor. REGISTRATION FOR HT ENTERPRISES PROGRAMS II version 1.0 is intended as a home product and also as a means for businesses, corporations and the like to evaluate the features and performance of the Instant Index concept. There is also a certain class of users which might find version 1 or some adaptation of it more useful than the full version (2), and for such applications, licensing arrangements must be made with HT Enterprises. THE VAST BULK OF USERS WILL HAVE A GREAT DEAL MORE USE FOR VERSION 2.0, AND IT IS NOT TERRIBLY EXPENSIVE! Good site license terms for Instant Index are available. No version of II may be used in businesses, organizations, corporations, schools, government agencies etc. for production work without proper licenses being in place. Home computer users may use II version 1.0 for one month on a demo basis. Beyond that, however, registration is required for continued use of II. The included form should be used to register a copy of II. Registered users of Instant Index (any version) receive technical support & news of upgrades and new products, which in the future will include other AI applications. If you haven't guessed already, II is an AI application. ............................................................................. REGISTRATION FORM For Individual Software Licenses PROGRAM: # COPIES: AMOUNT: II Version 2.0 ($200 per copy) _________ $______________ Intro price good thru 5/30/93 II Version 1.0 ( $30 per copy) _________ $______________ TOTAL. . . . . . . . . . . . . . . . . . . . . . . $______________ PAYMENT BY: Check/Money Order No.__________ enclosed for $____________________ MAILING ADDRESS: NAME______________________________________________________________ ADDRESS LINE 1____________________________________________________ ADDRESS LINE 2____________________________________________________ CITY/STATE/PROVINCE_______________________________________________ COUNTRY/POSTAL CODE_______________________________________________ HOME PHONE________________________________________________________ OFFICE PHONE______________________________________________________ SEND TO: HT Enterprises 8375 Leesburg Pike, Suite 422 Vienna Va. 22182 Call HT Enterprises at (703) 760-9713 for site license pricing. INSTANT INDEX By HT Enterprises ASSUMPTIONS Instant Index is a piece of software designed for handling large to gigantic text data files. Instant Index runs under MicroSoft Windows 3.1 and assumes at least a 386/486 based computer with a minimum of 4 MB of memory and a mouse pointing device. Instant Index assumes an ASCII text file with a .txt extension, and creates a corresponding .con (control) index file. Aside from the program itself, Instant Index must keep one of these index files in memory (Windows 3.1 swap space on disk is included as memory in this reckoning) while searching the .txt file. The index files are typically around 6% the size of the original data file. This means that a 100 MB file could be searched easily enough with a 486 computer with 8 MB RAM memory. 486 Computers are now being configured with 64 MB of RAM; this means that the outer limits of size for text files for use with Instant Index should be around 800 MB or so. Bottom line: Instant Index absorbs around 400K bytes when loaded with a minimal sized control file. You need that 400K plus enough space for your index file. It is assumed that users are familiar with DOS files and directories, normal copy commands etc., and with the workings of MS Windows, ordinary file and font dialog boxes etc. SETUP You paid lots of money for Instant Index; therefore, it should be time-consuming and difficult to install on your computer, right? Sorry to dissapoint you. You'll find two executables on the distribution diskette: II.exe (the main program), and wtxt.exe, which is the program which creates indices. II.exe calls wtxt.exe with a WinExec call, which means that wtxt.exe has to be in a directory which is on your path. II.exe could be anywhere. You simply go through the MS Windows process for adding an executable to one of the normal program groups, which would usually be the Windows Applications group. Copyright 1992 Ted Holden I. What Instant Index is and isn't. Instant Index is an awsomely fast system for indexing and searching large to gigantic text files. It assumes a user has one or more ascii text files with a .txt extension, and then creates matching .con (control) files for indexing. The text files may then be searched for words or combinations of words in settable proximity, and text may then be pasted into typical MS Windows word processing software using the Windows clipboard. Instant Index is single-purpose; it does one thing and does that one thing well. II. Technical Basis Typical text-search software generates tables of key-words which hash into tables of linked lists of sector locations for a data file. This methodology allows fast search once it is set up for a particular data file, but setting it up is very time consuming. Index files (the keyword tables and linked lists etc.) tend to be not much smaller than the original data files, which can be a problem with very large files. Instant Index, on the other hand, utilizes statistical methods and a variant of the Lawrence transform to achieve a very fast correlation between textual content and location, and produces index files which are typically 6 percent of the size of the original data file. This system is more malleable than the standard keyword hashing algorithms; a number of desirable functions, such as actual fuzzy searching on very large data sets, are natural fallouts of the technology. It is not easy to imagine fuzzy searching on a file too large for memory using keyword tables and hashing algorithms. III. Speed and Power. The standard test file which we've been working with at HTE is the King James Bible, about 4.6 MB of text, and Instant Index can index that in something like 30 seconds on a 33 MH generic 486 with a 17 MS disk. This would allow a 100 MB file to be indexed for rapid search in under 15 minutes on a computer costing less than $2000. This sort of power and speed give a user options which he otherwise simply would not have in dealing with large text data sets. Text being scanned in or piling in from a news feed, for instance, can now be dealt with in rapid and easy fashion. The thought of re-indexing a large file which has changed ceases to cause the fear and panic which it formerly did. IV. Characteristics of Instant Index. In contrast to normal software, Instant Index has some of the same characteristics, the same strengths and, occasionally, a few of the same kinds of quirks as the human mind and human memory. There are two pieces to the Instant Index search mechanism: the transform-based initial search engine, the action of which is instantaneous in all cases, and a "grep" - like secondary or clean-up function. Normal text search software gets slow when given long search strings; Instant Index gets faster. The more specific a search criteria you give it, i.e. the longer the search string, the closer you come to having the math-transform 1'st stage system do all of the work, and the faster the whole process becomes. For instance, the character string "lions" occurs in "millions" and a number of other words; the fragment "ions" occurs even more often. Therefore a search of the bible (our standard test material) for "lions" returns several hundred hits, too many to serve any useful purpose. Adding the string "Daniel", or "mastery", however, narrows the search down to a few instances in the book of Daniel, the response being nearly instantaneous. Typical search phrases such as "behold, a pale horse", or "fishers of men" , are plenty long enough in most instances to return the one or two hits expected and nothing more. Words such as "the" or "and" add nothing to a typical search for obvious reasons, and may be omitted. Any word with an unusual combination of letters, such as "archeologist" or "paleontologist", or likewise any word with four or more syllables, will often work well by itself as a search criteria. When a search turns up too many hits to be useful, you can always add another word to the end of the search string and try again. Adding words always narrows the search down and speeds things up. At times, you have to be a little bit smart about how you use any tool, and II is no exception. For instance, searching Shakespeare's works for a famous phrase, such as Hamlet's "To be or not to be, that is the question!", turns out to be very slow on II. The truth of the matter is, that the only word in that whole phrase with any power of discrimination within the context of English text, is the word "question". A search for "Whether tis nobler of the mind" turns out to be quite fast and is, for fairly obvious reasons, a better use of the tool. V. Verify and Redline. The Verify and Redline functions (menu keys) effect the actions of a search. Verify is the "grep" - like, or ordinary search function which cleans up after the action of the statistical engine of Instant Index. Anytime a search returns more slowly than instantaneously, Verify is at work. Verify removes false hits, or the tiny amount of statistical aliasing produced by the statistical engine of Instant Index. If you turn Verify off, II (Instant Index) becomes instantaneous in all cases, but you'll find yourself having to give longer and more precise search strings to cut the number of hits down to acceptability. The normal situation in which you turn verify off is for fuzzy matching applications in which you assume data produced from scanning and OCR is less than 100% good on spelling. In that case, Verify would always fail upon encountering a misspelled word, and would prevent the entire process from working. Redline highlights the section of text which you are looking for when a hit is returned to the screen. Redline has two modes: 2-Lines and All. 2-Lines highlights the text you are searching for only when it occurs within two successive lines of text, which is normal for a phrase. The All option causes highlighting to occur for any line containing any word within the search criteria. When using this, you must leave out words such as "and", "the", "a" etc. or every line on the page will be lit up. The All option is good when searching for a few key words which may be assumed to lie in close proximity, but not necessarily on the same one or two lines, in a particular section of a large data file. VI. Motion Control: Next Hit, Previous Hit, Forward, Back, scrollbar The parameters dialog box for the indexing function of Instant Index allows you to set a data file sector size (not the same notion as disk sector size) for searching. Searches then seek sectors which contain all of the words in a search criteria. If, for instance, five such sectors are found, the search will come back with a message box claiming <5 HITS!>. The first hit sector will be put up on the screen, or at least as much of that sector as the screen will hold. Next Hit and Previous Hit move to the next or previous hit sector. Forward and Back position the file forward or back 512 bytes at a time. The scroll bar included in Instant Index positions the view screen within the text file and indicates, more or less, where in the text file a search string has been found. VII. Minimizing aliasing and false hits. The advantages of Instant Index in comparison with standard text search software are huge. The only very minor down side is aliasing or false hits, which comes with the territory with a statistical methodology, and this can easily be controlled. The statistical back-end engine returns all sectors in which all words in a search criteria occur. Making the search string longer and more precise allways narrows the search down and speeds up the process, since it cuts down the amount of work required of the verify function. VIII. Double Hits. Instant Index occasionally returns a double hit, i.e. returns the same hit twice. This is is a very minor nuisance which is unavoidable in the design of such a package. It is a by-product of the system for insuring that search strings which span two file sectors still get reported without losing performance or increasing index file size. IX. Open and Fonts The Open function assumes that a .txt file and a corresponding .con file exist in a directory somewhere, i.e. that you have availed yourself of the Index function to create a .con file for your .txt file. Other than that, Open is just an ordinary Borland FileDialog box. Fonts is a fairly standard font select dialog box. If you haven't seen these before, clicking on ".." is equivalent to "CD .." under DOS or UNIX. There's nothing else mystical about them. X. Redlining and Copy/Paste Aside from lines which get redlined by the Verify function, you can hold the left mouse key down and redline any lines which appear on the screen. Clicking the right mouse key undoes any redlining. The Copy/Paste key puts any redlined text into the MS Windows clipboard edit buffer, from which it may be retrieved using the "Paste" feature of any full-function MS Windows word processor. This is the normal use of Instant Index. Basically, you find something you want in a huge text file, then you paste it into a word processor and do your own thing with it. XI. Fuzzy logic and wildcard-like searching Including the fragment "direct" in a search criteria will return "director", "direction", "directing" etc. etc. i.e. wildcard searching is achieved by simple shortening or omission. Fuzzy searching is another topic. We believe we have done the best job which is doable with fuzzy searching with Instant Index, nor is it obvious that fuzzy searching could be achieved at all for a file too large for memory using traditional methods. Fuzzy searching means being able to find text which might be misspelled. Bottom line is that the best you could ever hope for is finding some percentage over 50% of such criteria. We believe we're way over 50%, but anything more precise than that would be a wild guess. Fuzzy logic is an overused concept, like the word "turbo". Your best procedure for dealing with scanned text or other text prone to misspellings, if you have this option, would be to run the text through some serious spell-checker and then use Instant Index on it. The guys who write spell-checkers are like us; they're good at what they do. For a very large scanned text file, this may not be possible. Read through the section on the parameters dialog box for the Index function so that you know what goes into preparing a file for fuzzy searching. Basically, when you create an index for a file which you plan to do fuzzy searching on, you want to set the Search Depth parameter as high as possible, allowing for the fact that the index file must be kept in memory. The Fuzzy Value dialog box allows you to set values of 0 (no fuzziness), 1 (one letter missed in a search criteria), or 2 (two letters missed in a search criteria). Beyond that only prayer would help. For fuzzy searching, set Verify to OFF and Redline to ALL. Fuzzy searching raises the rate of statistical aliasing. You have to know something about what you're looking for. Basically, you just keep adding words to the end of the search string (in the Search dialog box), untill the number of hits is down to something acceptable. XII. Search. The Search function brings up an ordinary edit dialog box in which you type a couple of words or a phrase to search for. The text you typed in remains after the search. You can add a word or two (to narrow down the search) simply by adding after the end of a string already in the dialog box. XIII. Indexing. The Index function executes the wtxt.exe program mentioned in the section on setup. Wtxt.exe is another MS Windows program, and may be thought of as simply a non-modal dialog box or extraneous window; that's precisely what it appears as. It has two functions: Create and Parameters. XIV. Create. The Create function is an ordinary Borland file dialog box. You use it to select a file with a .txt extension and create a corresponding index (.con) file for it. After that, you can either leave wtxt.exe on the screen, possibly to create several .con files in one sitting, or close it. XV. Parameters. The Parameters function in wtxt.exe allows you to set a number of parameters which figure into creating index files: A. Alphabet: The upper and lower case characters of the alphabet being used for searching. This could be anything for which an MS Windows font exists. There is no reason why German or Russian text or even something as strange as French text could not be searched. Instant Index is not case sensitive. Be sure that upper and lower cases include equal numbers of characters. We assume a phoenetic alphabet, left-to-right, high-to-low, all of those sorts of things. A plug for another of our products may in fact be in order here. We have one of the most interesting Russian font sets in existence, including standard Cyrillic, a fairytale font, and a Russian version of a Cloister font in ATM format. Call for info. B. Other Characters: Other characters (than the alphabet) to include in search strings. Typically, just the numbers 0 - 9. For instance, for bible searching, you might also include a colon ( : ) to allow you to search for such things as "Gen 1:7". Instant Index allows a total of 60 characters all told, counting each upper/lower-case pair as one character. C. Search Density. Basically, this is just the size of the index file. Raising this value by one doubles the size of the index file from the previous value. The up side is that this reduces statistical aliasing. This becomes helpfull for fuzzy searching. Assuming somebody doing fuzzy searching has the memory to deal with it, the larger index file is better. D. Text File Section Size. This is the size (in bytes) of a section within the text file to serve as a base of reference. Instant Index thinks of the text file as consisting of sectors of this unit of size. The back-end statistical search engine returns sectors within which all words of a search criteria are found. 2048 Bytes is the default. As of now, we can think of no real reason for having the sector size smaller in the normal case. For fuzzy logic searching with Verify off, a lower value would let you see an entire file sector on one screen, which might be helpful. Halving the section size doubles the size of the index file. E. Anti Aliasing. This one is a no-brainer. Anti aliasing is set for the English language at present. For English text, leave it on. For other language text, turn it off. The feature is worth having, as it generally reduces the incidence of false hits and speeds up the program (reduces the job of the Verify function). We would require 5 - 10 MB of text in another language, along with an appropriate MS Windows font, to set up a version with Anti-aliasing for another language. The unique anti-asliasing feature of Instant Index is the chief point which differentiates this program from other attempts to use the Lawrence transform, and what allows the program to use an index file 6% the size of the data file rather than the more usual 20 - 50%.